class: center, middle, inverse, title-slide .title[ # Data Science Applications: Lecture 7 ] .subtitle[ ## Introduction to Machine Learning and Logistic Regression: Part 1 of 2 ] .author[ ### Josemari Feliciano ] .institute[ ### DATA 312 - American University ] .date[ ### Spring 2025 - March 18 ] --- ## Recap from last lecture We spent most of last lecture doing map-making exercises (e.g., tract-, county-, state-level maps). You've had plenty of practices via homework and midterm. </br> Speaking of which ... <style type="text/css"> .tiny .remark-code { /*Change made here*/ font-size: 70% !important; } .extra-tiny .remark-code { /*Change made here*/ font-size: 50% !important; } </style> --- ## Administrative: First midterm I am aiming to grade the first midterm before next class (March 25th). </br> __Brief observation:__ Plenty of you finished within 90 minutes which was indeed my target length for the exam despite me giving you plenty of extra time. --- ## Goals Today - To go over key introductory data science and machine learning concepts. - Introduction to logistic regression as a statistical model. </br> Next week, we will go over logistic regression as a "classifier" tool. --- ## Why is Data Science Important? __What is Data Science?__ Data Science is an interdisciplinary field that extracts insights from data using statistics, programming, and domain expertise. __May involve the following:__ - Problem Identification (Define the question) - Data Collection (Gather relevant data) - Data Cleaning (Handle missing values, format inconsistencies) - Exploratory Data Analysis (EDA) (Understand distributions, trends) - Modeling & Analysis (Statistical and machine learning techniques) - Interpretation & Communication (Insights and storytelling with data) --- ## Why is Data Science Important? </br> __Real-world Applications:__ Public health (disease prediction, outbreak tracking) Business (customer insights, marketing optimization) Government (policy analysis, fraud detection) Tech (AI, recommendation systems) --- ## Data Types: Structured vs Unstructured Data __Structured Data:__ Structured data is highly organized and easily accessible data, typically stored in tables or databases with a clear, predefined format. __Key Characteristics:__ - Typically organized in rows and columns (e.g., relational databases). - Uses a defined schema (e.g., SQL, CSV). - Easily searchable and analyzed using traditional tools (e.g., SQL queries). - __Examples:__ Customer data in a CRM (Name, Age, Email), Financial records (transaction amounts, dates). __Advantages:__ - Easy to process and analyze. - Efficient querying and reporting. - Highly compatible with business intelligence tools (e.g., PowerBI, Tableau). --- ## Data Types: Structured vs Unstructured Data __Unstructured Data:__ Unstructured data is raw, unorganized, and lacks a specific format or structure, making it harder to store, process, and analyze. __Key Characteristics:__ - No predefined schema or organization. - Typically stored in formats like text, images, videos, audio, etc. - Difficult to search or analyze without advanced tools (e.g., machine learning, NLP). - __Examples:__ Social media posts; emails, PDFs, and Word documents; and video or audio files. __Advantages:__ - Can contain valuable insights (e.g., sentiments, trends). - Larger variety of data types (text, multimedia). - May offers deeper context when analyzed using advanced methods (e.g., AI, network analysis). --- ### Example work using unstructured data: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/networks.png" alt="Figure 1. Network analysis plot I created to map the flow of brain tumor information on X, previously Twitter. Social media data cleaned or preprocessed using R, then I used Gephi (open source software) to create image." width="40%" /> <p class="caption">Figure 1. Network analysis plot I created to map the flow of brain tumor information on X, previously Twitter. Social media data cleaned or preprocessed using R, then I used Gephi (open source software) to create image.</p> </div> __Note:__ Image from a poster presentation found in https://elischolar.library.yale.edu/cgi/viewcontent.cgi?article=1125&context=dayofdata --- ## Labels: Understanding Labeled and Unlabeled Data __What is Labeled Data?__ Data where each observation has explicit labels or annotations (e.g., outcomes or key classification that you may want to predict). Example: Medical images labeled as "cancerous" or "non-cancerous". Loan database where each loan has a column for "fraudulent" or "non-fraudulent". __What is Unlabeled Data?__ Data without explicit labels (e.g., outcomes)-—only raw observations. Example: Text data from a PDF textbook, customer purchase history, social media data. --- ## Introduction to Machine Learning: Supervised vs Unsupervised. Machine learning algorithms can be broadly categorized into supervised and unsupervised learning. __Supervised Learning:__ Supervised learning is a type of machine learning where a model is trained on labeled data—each input has a corresponding correct output. The goal is for the model to learn the relationship between inputs and outputs to make predictions on new data. Typically used for classification (e.g., spam detection) and regression (e.g., predicting house prices). Examples: Email filtering (spam vs. not spam), fraud detection in financial transactions, medical diagnosis (disease detection from medical scans). --- ## Introduction to Machine Learning: Supervised vs Unsupervised. __Unsupervised Learning:__ Unsupervised learning is a type of machine learning where a model is trained on unlabeled data, meaning the algorithm finds patterns and structures in the data without predefined outputs. Uses unlabeled data (no explicit categories or labels). Typically used for clustering (grouping similar data points) and dimensionality reduction (simplifying complex data). --- ## An example of unsupervised learning: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/pca.png" alt="Figure 2. An example of what is called Principal Component Analysis (PCA) using data from the palmerpenguins package. Let us briefly discuss this image." width="70%" /> <p class="caption">Figure 2. An example of what is called Principal Component Analysis (PCA) using data from the palmerpenguins package. Let us briefly discuss this image.</p> </div> --- ## Hypothesis testing and statistical models A statistical method used to make inferences or draw conclusions about a larger population based on sample data. It helps determine if an observed effect is due to chance or a real difference. __Null Hypothesis__: No effect or no difference ("status quo"). Example: A new drug has no effect on blood pressure. __Alternative Hypothesis__: There is an effect or difference. Example: A new drug lowers blood pressure. --- ## Steps in Hypothesis Testing (Formal) - __Define the Hypotheses__ Identify the null hypothesis and alternative hypothesis based on the research question (e.g., a new drug has no effect vs has an effect on blood pressure). - __Set the Significance Level (α)__ Nearly always 0.05 - __Choose a Statistical Test__ Based on data type and study design (e.g., t-test, chi-square test, __logistic regression__). - __Compute the Test Statistic & P-value__ The test statistic measures how extreme the sample result is. A P-value is generated which indicates the probability of obtaining results as extreme as the observed, assuming the null hypothesis is true. - __Draw a Conclusion__ If P-value ≤ α: Reject H₀ (support for alternative hypothesis). If P-value > α: Fail to reject H₀ (insufficient evidence). In practice, most people simply run a statistical model, and see if the P-value of the model or variable is less than 0.05. __I will simplify this further with a lot of context after we run logistic regression models.__ --- class: center, middle Now, let us go over the basics of logistic regression. --- ## Binary Logistic Regression Binary logistic regression (typically referred to as "logistic regression") is a statistical model or machine learning technique typically used to analyze data with binary outcomes (e.g., cancer vs not cancer, spam vs spam). Binary logistic regression is technically a special case of __generalized linear model__ (abbreviated and often just referred to as GLM). We typically use binary logistic regression to: - model the odds of a response variable as a function of some explanatory variables, e.g., getting breast cancer as a function of age or menopausal status. - classify individuals into two categories based on explanatory variables, e.g., classify new emails into "spam" or "not spam" groups based on the inclusion of suspicious hyperlinks or links. --- ## Logistic Regression: A Quick Note Binary logistic regression (typically referred to as "logistic regression") is normally what is taught in regression and data science courses. This is the focus of our lecture 7 and 8. But there are in fact other types of logistic regression: __Multinomial logistic regression__ is used when the goal is to estimate the relationship between a nominal dependent variable with three or more __unordered (nominal)__ outcomes, and one or more independent variables/predictors. For example, a biologist may be interested in three food choices that alligators make. Adult alligators might have different preferences from young ones. The outcome variable here will be the types of food (e.g., chicken, fish, mammals), and the predictor variables might be size of the alligators. --- ## Logistic Regression: A Quick Note Binary logistic regression (typically referred to as "logistic regression") is normally what is taught in regression and data science courses. This is the focus of our lecture 7 and 8. But there are in fact other types of logistic regression: __Ordinal logistic regression__ is used when the goal is to estimate the relationship between an ordinal dependent variable with three or more __ordered__ outcomes, and one or more independent variables/predictors. For example, a marketing research firm wants to investigate what factors influence the size of soda (small, medium, large or extra large) that people order at a fast-food chain. Predictor factors may include what type of sandwich is ordered (burger or chicken), whether or not fries are also ordered, and age of the consumer. --- ## Binary Logistic Regression: Odds vs Probabilities We mentioned that the focus of logistic regression is predicting binary outcomes. More formally, we model __binomially distributed data__ (statistical jargon). To understand logistic regression we need to remind ourselves about odds and probability. --- ## Binary Logistic Regression: Odds vs Probabilities The difference is the denominator. For odds, the math is simple: # of events (e.g., blue) / # of non-events (e.g., not blue). When interpreting logistic regression as a stastical model, we typically use or center our results in odds (this will make more sense after you've interpreted a basic logistic regression model by the end of the lecture). <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/odds.png" alt="Figure 3. Visual representation of odds vs probabilities." width="70%" /> <p class="caption">Figure 3. Visual representation of odds vs probabilities.</p> </div> --- ## Binary Logistic Regression: Single Predictor Figure 4 is a formal formula of a typical logistic regression model with one single predictor _which is not a focus of this class_ __(our focus is applied interpretation)__. But when we are running logistic regression model, we are actually going to model the __logarithm of the odds for the outcome/success__, and the logistic regression model will be written as follows: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/math.png" alt="Figure 4. Logistic Regression Model." width="70%" /> <p class="caption">Figure 4. Logistic Regression Model.</p> </div> --- ## Binary Logistic Regression: Single Predictor For this example, let us work with the `heart_study` data where: - `heart_attack` is the binary outcome we want to model (1 = had heart attack, 0 = no heart attack) - `isFemale` is 1 if Female (0 otherwise) - `age` is the age at the time of data collection. ``` r library(tidyverse) heart_study <- read_csv("https://raw.githubusercontent.com/jmtfeliciano/teachingdata/refs/heads/main/Data312Sp2025/heart_attack_data.csv") head(heart_study) ``` ``` ## # A tibble: 6 × 3 ## heart_attack isFemale age ## <dbl> <dbl> <dbl> ## 1 1 1 83 ## 2 1 0 89 ## 3 1 1 99 ## 4 0 0 54 ## 5 1 0 83 ## 6 0 0 60 ``` --- ## Binary Logistic Regression: isFemale as a predictor __Question you may want to ask if you are a researcher:__ `isFemale` a predictor of `heart_attack`? In public health language, is being Female a risk factor for heart attack (myocardial infarction)? Given the outcome is binary, this is a perfect place to use logistic regression to properly __quantify__ relationship between the variables involved. --- ## Binary Logistic Regression: isFemale as a predictor To run a logistic regression model in R, we need to use `glm()`. For the `formula` argument, you may think of the `~` symbol as an equal sign in equations. You need to ensure that the outcome variable (binary variable) is to the left of the `~` symbol. You need to specify the data frame you want to use for the `data` argument, while you need to set the `family` argument as "binomial" to specify you want to use logistic regression. __We will dissect this model on the next page__ .tiny[ ``` r # creates the logistic regression model model <- glm(formula = heart_attack ~ isFemale, data = heart_study, family = "binomial") # summarizes/quantifies the model summary(model) ``` ``` ## ## Call: ## glm(formula = heart_attack ~ isFemale, family = "binomial", data = heart_study) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.0245 0.3116 -3.288 0.00101 ** ## isFemale 1.3246 0.4291 3.087 0.00202 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 135.37 on 99 degrees of freedom ## Residual deviance: 125.31 on 98 degrees of freedom ## AIC: 129.31 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ## Binary Logistic Regression: isFemale as a predictor Pay close attention to `isFemale`. In particular, please note the numbers under `Estimate` (1.3246) and `Pr(>|z|)` (0.00202). We will interpret this model next page! .tiny[ ``` r # creates the logistic regression model model <- glm(formula = heart_attack ~ isFemale, data = heart_study, family = "binomial") # summarizes/quantifies the model summary(model) ``` ``` ## ## Call: ## glm(formula = heart_attack ~ isFemale, family = "binomial", data = heart_study) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.0245 0.3116 -3.288 0.00101 ** ## isFemale 1.3246 0.4291 3.087 0.00202 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 135.37 on 99 degrees of freedom ## Residual deviance: 125.31 on 98 degrees of freedom ## AIC: 129.31 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ## Binary Logistic Regression: isFemale as a predictor __Last slide/above:__ Pay close attention to `isFemale`. In particular, please note the numbers under `Estimate` (1.3246) and `Pr(>|z|)` (0.00202). __First question to ask here:__ Is isFemale as a statistically significant predictor of heart attack? Yes because `Pr(>|z|)`, which indicates the P-value, is lower than 0.05 (see the formal hypothesis testing framework earlier). --- ## Quick reminder of an earlier slide: But when we are running logistic regression model, we are actually going to model the __logarithm of the odds for the outcome/success__, and the logistic regression model will be written as follows: <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/math.png" alt="Figure 4. Logistic Regression Model." width="70%" /> <p class="caption">Figure 4. Logistic Regression Model.</p> </div> --- ## Binary Logistic Regression: isFemale as a predictor __How to quantify impact of isFemale:__ Reminder: `Estimate` (1.3246) and `Pr(>|z|)` (0.00202). 1.3246 is technically in log. So we have to exponentiate this value first: `\(e^{1.3246}\)`. In R, you do this by running this script: ``` r exp(1.3246) ``` ``` ## [1] 3.760681 ``` If `exp(estimate)` > 1, variable increases the odds for outcome. If `exp(estimate)` < 1, variable decreases the odds for outcome. If OR = 1: No association between predictor and outcome. Since exp(1.3246) > 1 (exact value of 3.760681), isFemale is associated with higher odds of heart attack. __More specific interpretation:__ The odds of having a heart attack are approximately 3.76 times greater for females than for non-females. --- ## Binary Logistic Regression: age as a predictor How about age as a stand alone predictor for heart_attack? For now, let us create another model. Again, pay attention to `Estimate` and `Pr(>|z|)`. .tiny[ ``` r # creates the logistic regression model model2 <- glm(formula = heart_attack ~ age, data = heart_study, family = "binomial") # summarizes/quantifies the model summary(model2) ``` ``` ## ## Call: ## glm(formula = heart_attack ~ age, family = "binomial", data = heart_study) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.94222 0.69910 -4.209 2.57e-05 *** ## age 0.04230 0.01054 4.014 5.98e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 135.37 on 99 degrees of freedom ## Residual deviance: 115.78 on 98 degrees of freedom ## AIC: 119.78 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ## Binary Logistic Regression: age as a predictor For age, `Estimate` is 0.04230 and `Pr(>|z|)` is 5.98e-05. Note: 5.98e-05 is equivalent to `\(5.98*10^{-5}\)` (a really small number). __First question to ask here:__ Is `age` a statistically significant predictor of heart attack? Yes because `Pr(>|z|)`, which indicates the P-value, is lower than 0.05 (see the formal hypothesis testing framework earlier). ``` r exp(0.04230) ``` ``` ## [1] 1.043207 ``` Again, since the exponentiated value (1.04) > 1, age is associated with higher odds for heart attack. Since age is a continuous variable, interpret is slightly different: for every 1-year increase in age, the odds of having a heart attack increase by approximately 4.32% (formula is (exp(Estimate) - 1)*100) %). ``` r (exp(0.04230) - 1 ) * 100 ``` ``` ## [1] 4.320739 ``` --- ## Binary Logistic Regression: isFemale and age as predictors The downside to running separate models like we did earlier (one model for isFemale only, the other for age only) is that we do not model the simultaneous influence of both variables to the outcome (heart attack). It is as if we are ignoring the effects of other variables to the outcome. In the real world, we typically model several predictors simultaneously by leveraging the plus sign inside formula (e.g, `formula = heart_attack ~ isFemale + age`): .tiny[ ``` r # creates the logistic regression model final_model <- glm(formula = heart_attack ~ isFemale + age, data = heart_study, family = "binomial") # summarizes/quantifies the model summary(final_model) ``` ] --- ### Binary Logistic Regression: multiple predictors .tiny[ ``` r # creates the logistic regression model final_model <- glm(formula = heart_attack ~ isFemale + age, data = heart_study, family = "binomial") # summarizes/quantifies the model summary(final_model) ``` ``` ## ## Call: ## glm(formula = heart_attack ~ isFemale + age, family = "binomial", ## data = heart_study) ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -4.14819 0.89030 -4.659 3.17e-06 *** ## isFemale 1.66693 0.50884 3.276 0.00105 ** ## age 0.04828 0.01170 4.125 3.71e-05 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 135.37 on 99 degrees of freedom ## Residual deviance: 103.62 on 97 degrees of freedom ## AIC: 109.62 ## ## Number of Fisher Scoring iterations: 4 ``` ] --- ### Binary Logistic Regression: multiple predictors <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#images/final_model.png" alt="Figure 5. Logistic Regression Model." width="80%" /> <p class="caption">Figure 5. Logistic Regression Model.</p> </div> In our final model that factors both variables and their impact to risk of heart attack, both remain significant predictors or contributor to heart attack (since both p-values are lower than 0.05). .tiny[ ``` r # exponentiated estimate for female status # note: stand alone model value was only 3.760681 vs 5.295884 now exp(1.66693) ``` ``` ## [1] 5.295884 ``` ``` r # exponentiated estimate for age # note: stand alone model value was only 1.043207 vs 1.049464 now exp(0.04828) ``` ``` ## [1] 1.049464 ``` ] --- ### Binary Logistic Regression: multiple predictors ``` r # exponentiated estimate for female status # note: stand alone model value was only 3.760681 vs 5.295884 now exp(1.66693) ``` ``` ## [1] 5.295884 ``` ``` r # exponentiated estimate for age # note: stand alone model value was only 1.043207 vs 1.049464 now exp(0.04828) ``` ``` ## [1] 1.049464 ``` ``` r (exp(0.04828) - 1)*100 ``` ``` ## [1] 4.946446 ``` __Interpretations:__ For the most part, the same interpretation. But since we are modelling the impact simultaneously, we normally add "after controlling for [other variable(s)]". See interpretation next page. --- ### Binary Logistic Regression: multiple predictors ``` r # exponentiated estimate for female status # note: stand alone model value was only 3.760681 vs 5.295884 now exp(1.66693) ``` ``` ## [1] 5.295884 ``` ``` r # exponentiated estimate for age # note: stand alone model value was only 1.043207 vs 1.049464 now exp(0.04828) ``` ``` ## [1] 1.049464 ``` ``` r (exp(0.04828) - 1)*100 ``` ``` ## [1] 4.946446 ``` __Interpretations:__ After controlling for sex, we estimate that for every 1-year increase in age, the odds of having a heart attack increase by approximately 4.95%. After controlling for age, we estimate the odds of having a heart attack are approximately 5.30 times greater for females than for non-females. --- ### Homework Note There won't be a new homework later this week. So the homework I will assign next week will be longer. We don't have a lab this week. However, I will give you access to one of the data sets I will assign for homework (breast_cancer_data.csv) which is under Week 7 on Canvas. There is also a short document (Breast cancer data descriptors.docx) that documents basic descriptors of the data columns/variables included. Part of the future homework will ask you to run a logistic regression model of your choosing to answer any question you may want answered. Think like a scientist! If you want to practice today, feel free to use the data set to get a head start. You may choose and pick any variable(s) of your choosing! Otherwise, if you want to leave early today, see you next week!